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Abstract. We present a technique for efficient stateless model checking of pro¬ 
grams that execute under the relaxed memory models TSO and PSO. The basis for 
our technique is a novel representation of executions under TSO and PSO, called 
chronological traces. Chronological traces induce a partial order relation on re¬ 
laxed memory executions, capturing dependencies that are needed to represent 
the interaction via shared variables. They are optimal in the sense that they only 
distinguish computations that are inequivalent under the widely-used representa¬ 
tion by Shasha and Snir. This allows an optimal dynamic partial order reduction 
algorithm to explore a minimal number of executions while still guaranteeing full 
coverage. We apply our techniques to check, under the TSO and PSO memory 
models, LLVM assembly produced for C/pthreads programs. Our experiments 
show that our technique reduces the verification effort for relaxed memory mod¬ 
els to be almost that for the standard model of sequential consistency. In many 
cases, our implementation significantly outperforms other comparable tools. 


1 Introduction 

Verification and testing of concurrent programs is difficult, since one must consider all 
the different ways in which instructions of different threads can be interleaved. To make 
matters worse, most architectures implement relaxed memory models, such as TSO and 
PSO [3 1 ,3], which make threads interact in even more and subtler ways than by standard 
interleaving. For example, a processor may reorder loads and stores by the same thread 
if they target different addresses, or it may buffer stores in a local queue. 

A successful technique for finding concurrency bugs (i.e., defects that arise only un¬ 
der some thread schedulings), and for verifying their absence, is stateless model check¬ 
ing (SMC) [15], also known as systematic concurrency testing [20,34]. Starting from a 
test, i.e., a way to run a program and obtain some expected result, which is terminat¬ 
ing and threadwisely deterministic (e.g. no data-nondeterminism), SMC systematically 
explores the set of all thread schedulings that are possible during runs of this test. A 
special runtime scheduler drives the SMC exploration by making decisions on schedul¬ 
ing whenever such decisions may affect the interaction between threads, so that the 
exploration covers all possible executions and detects any unexpected test results, pro¬ 
gram crashes, or assertion violations. The technique is completely automatic, has no 
false positives, does not suffer from memory explosion, and can easily reproduce the 
concurrency bugs it detects. SMC has been successfully implemented in tools such as 
VeriSoft [16], CHESS [24], and Concuerror [11], 

There are two main problems for using SMC in programs that run under relaxed 
memory models (RMM). The first problem is that already under the standard model of 


sequential consistency (SC) the number of possible thread schedulings grows exponen¬ 
tially with the length of program execution. This problem has been addressed by partial 
order reduction (POR) techniques that achieve coverage of all thread schedulings, by 
exploring only a representative subset [33,26,14,12], POR has been adapted to SMC 
in the form of Dynamic Partial Order Reduction (DPOR) [13], which has been further 
developed in recent years [28,20,18,27,32,1]. DPOR is based on augmenting each exe¬ 
cution by a happens-before relation, which is a partial order that captures dependencies 
between operations of the threads. Two executions can be regarded as equivalent if they 
induce the same happens-before relation, and it is therefore sufficient to explore one ex¬ 
ecution in each equivalence class (called a Mazurkiewicz trace [23]). DPOR algorithms 
guarantee to explore at least one execution in each equivalence class, thus attaining 
full coverage with reduced cost. A recent optimal algorithm [1] guarantees to explore 
exactly one execution per equivalence class. 

The second problem is that in order to extend SMC to handle relaxed memory mod¬ 
els, the operational semantics of programs must be extended to represent the effects 
of RMM. The natural approach is to augment the program state with additional struc¬ 
tures, e.g., store buffers in the case of TSO, that model the effects of RMM [2,4,25]. 
This causes blow-ups in the number of possible executions, in addition to those pos¬ 
sible under SC. However, most of these additional executions are equivalent to some 
SC execution. To efficiently apply SMC to handle RMM, we must therefore extend 
DPOR to avoid redundant exploration of equivalent executions. The natural definition 
of “equivalent” under RMM can be derived from the abstract representation of execu¬ 
tions due to Shasha and Snir [30], here called Shasha-Snir traces, which is often used 
in model checking and runtime verification [17,19,9,10,6,7]. Shasha-Snir traces consist 
of an ordering relation between dependent operations, which generalizes the standard 
happens-before relation on SC executions; indeed, under SC, the equivalence relation 
induced by Shasha-Snir traces coincides with Mazurkiewicz traces. It would thus be 
natural to base DPOR for RMM on the happens-before relation induced by Shasha-Snir 
traces. However, this relation is in general cyclic (due to reorderings possible under 
RMM) and can therefore not be used as a basis for DPOR (since it is not a partial or¬ 
der). To develop an efficient technique for SMC under RMM we therefore need to find 
a different representation of executions under RMM. The representation should define 
an acyclic happens-before relation. Also, the induced trace equivalence should coincide 
with the equivalence induced by Shasha-Snir traces. 

Contribution In this paper, we show how to apply SMC to TSO and PSO in a way 
that achieves maximal possible reduction using DPOR, in the sense that redundant ex¬ 
ploration of equivalent executions is avoided. A cornerstone in our contribution is a 
novel representation of executions under RMM, called chronological traces, which de¬ 
fine a happens-before relation on the events in a carefully designed representation of 
program executions. Chronological traces are a succinct canonical representation of 
executions, in the sense that there is a one-to-one correspondence between chronologi¬ 
cal traces and Shasha-Snir traces. Furthermore, the happens-before relation induced by 
chronological traces is a partial order, and can therefore be used as a basis for DPOR. 
In particular, the Optimal-DPOR algorithm of [1] will explore exactly one execution 
per Shasha-Snir trace. In particular, for so-called robust programs that are not affected 


by RMM (these include data-race-free programs), Optimal-DPOR will explore as many 
executions under RMM as under SC: this follows from the one-to-one correspondence 
between chronological traces and Mazurkiewicz traces under SC. Furthermore, robust¬ 
ness can itself be considered a correctness criterion, which can also be automatically 
checked with our method (by checking whether the number of equivalence classes is 
increased when going from SC to RMM). 

We show the power of our technique by using it to implement an efficient stateless 
model checker, which for C programs with pthreads explores all executions of a test- 
case or a program, up to some bounded length. During exploration of an execution, our 
implementation generates the corresponding chronological trace. Our implementation 
employs the source-DPOR algorithm [1], which is simpler than Optimal-DPOR, but 
about equally effective. Our experimental results for analyses under SC, TSO and PSO 
of number of intensely racy benchmarks and programs written in C/pthreads, shows that 
(i) the effort for verification under TSO and PSO is not much larger than the effort for 
verification under SC, and (ii) our implementation compares favourably against CBMC, 
a state-of-the-art bounded model checking tool, showing the potential of our approach. 


2 Overview of Main Concepts 


This section informally motivates and explains the main concepts of the paper. To fo¬ 
cus the presentation, we consider mainly the TSO model. TSO is relevant because it is 
implemented in the widely used x86 as well as SPARC architectures. We first introduce 
TSO and its semantics. Thereafter we introduce Shasha-Snir traces, which abstractly 
represent the orderings between dependent events in an execution. Since Shasha-Snir 
traces are cyclic, we introduce an extended representation of executions, for which a 
natural happens-before relation is acyclic. We then describe how this happens-before 
relation introduces undesirable distinctions between executions, and how our new rep¬ 
resentation of chronological traces remove these distinctions. Finally, we illustrate how 
a DPOR algorithm exploits the happens-before relation induced by chronological traces 
to explore only a minimal number of executions, while still guaranteeing full coverage. 

TSO — an Introduction TSO relaxes the order¬ 
ing between stores and subsequent loads to differ¬ 
ent memory locations. This can be explained op¬ 
erationally by equipping each thread with a store 
buffer [29], which is a FIFO queue that contains Fig. 1: A program implementing 
pending store operations. When a thread executes the classic idiom of Dekker’s mu- 
a store instruction, the store does not immediately tual exclusion algorithm, 
affect memory. Instead it is delayed and enqueued in the store buffer. Nondeterministi- 
cally, at some later point an update event occurs, dequeueing the oldest store from the 
store buffer and updating the memory correspondingly. Load instructions take effect 
immediately, without being delayed. Usually a load reads a value from memory. How¬ 
ever, if the store buffer of the same thread contains a store to the same memory location, 
the value is instead taken from the store in the store buffer. 


p 

q 

store: x :=1 

store: y:=l 

load: $r:=y 

load: $s:=x 



To see why this buffering semantics 
may cause unexpected program behav¬ 
iors, consider the small program in Fig. 1. 
It consists of two threads p and q. The 
thread p first stores 1 to the memory loca¬ 
tion x, and then loads the value at memory 
location y into its register $r. The thread q 
is similar. All memory locations and reg- 


p: store: x :=1 // Enqueue store 

p: load: $r:=y // Load value 0 

q\ store: y:=l // Enqueue store 
q: update // y = 1 in memory 

q\ load: $s:=x // Load value 0 

p: update // x = 1 in memory 

Fig. 2: An execution of the program in 
Fig. 1. Notice that $r = $s = 0 at the end. 


isters are assumed to have initial values 0. 

It is easy to see that under the SC semantics, it is impossible for the program to ter¬ 
minate in a state where both registers $r and $.s hold the value 0. However, under the 
buffering semantics of TSO, such a final state is possible. Fig. 2 shows one such pro¬ 
gram execution. We see that the store to x happens at the beginning of the execution, but 
does not take effect with respect to memory until the very end of the execution. Thus 
the store to x and the load to y appear to take effect in an order opposite to how they 
occur in the program code. This allows the execution to terminate with $r = $s = 0. 


Shasha-Snir Traces for TSO Partial order reduction is based on the idea of capturing 
the possible orderings between dependent operations of different threads by means of a 
happens-before relation. When threads interact via shared variables, two instructions are 
considered dependent if they access the same global variable, and at least one is a write. 
For relaxed memory models, Shasha and Snir [30] introduced an abstract representation 
of executions, here referred to as Shasha-Snir traces , which captures such dependencies 
in a natural way. Shasha-Snir traces induce equivalence classes of executions. Under 
sequential consistency, those classes coincide with the Mazurkiewicz traces. Under a 
relaxed memory model, there are also additional Shasha-Snir traces corresponding to 
the non-sequentially consistent executions. 

A Shasha-Snir trace is a directed graph, 
where edges capture observed event orderings. 

The nodes in a Shasha-Snir trace are the executed 
instructions. For each thread, there are edges be¬ 
tween each pair of subsequent instructions, creat¬ 
ing a total order for each thread. For two instruc¬ 
tions i and j in different threads, there is an edge 
i j in a trace when i causally precedes j. This 
happens when j reads a value that was written by i, when i reads a memory location 
that is subsequently updated by j, or when i and j are subsequent writes to the same 
memory location. In Fig. 3 we show the Shasha-Snir trace for the execution in Fig. 2. 


V 

q 

store: x : = 1 C 

A store: y:=l 

1 J 

load: $r: =y 

V ^ 

' v - load: $s:=x 


Fig. 3: The Shasha-Snir trace corre¬ 
sponding to the execution in Fig. 2. 


Making the Happens-Before Relation Acyclic Shasha-Snir traces naturally represent 
the dependencies between operations in an execution, and are therefore a natural basis 
for applying DPOR. However, a major problem is that the happens-before relation in¬ 
duced by the edges is in general cyclic, and thus not a partial order. This can be seen 
already in the graph in Fig. 3. This problem can be addressed by adding nodes that 
represent explicit update events. That would be natural since such events occur in the 
representation of the execution in Fig. 2. When we consider the edges of the Shasha- 
Snir trace, we observe that although there is a conflict between p : load: $r:=y and 




q : store: y: =1, swapping their order in the execution in Fig. 2 has no observable ef¬ 
fect; the load still gets the same value from memory. Therefore, we should only be 
concerned with the order of the load relative to the update event q : update. 

These observations suggest to define 
a representation of traces that separates 
stores from updates. In Fig. 4 we have re¬ 
drawn the trace from Fig. 3. Updates are 
separated from stores, and we order up¬ 
dates, rather than stores, with operations 
of other threads. Thus, there are edges 
between updates to and loads from the 
same memory location, and between two 
updates to the same memory location. In 
Fig. 4, there is an edge from each store to the corresponding update, reflecting the prin¬ 
ciple that the update cannot occur before the store. There are edges between loads and 
updates of the same memory location, reflecting that swapping their order will affect 
the observed values. However, notice that for this program there are no edges between 
the updates and loads of the same thread, since they access different memory locations. 

Chronological Traces for TSO Although the new representation is a valid partial 
order, it will in many cases distinguish executions that are semantically equivalent ac¬ 
cording to the Shasha-Snir traces. The reason for this is the mechanism of TSO buffer 
forwarding: When a thread executes a load to a memory location x, it will first check 
its store buffer. If the buffer contains a store to x, then the load returns the value of the 
newest such store buffer entry instead of loading the value from memory. This causes 
difficulties for a happens-before relation that orders any update with any load of the 
same memory location. 

For example, consider the program shown in Fig. 5. 

Any execution of this program will have two updates 
and one load to x. Those accesses can be permuted in 
six different ways. Fig. 6(a), 6(b) and 6(c) show three of 
the corresponding happens-before relations. In Fig. 6(a) Fig-5: A program illustrat- 
and 6(b) the load is satisfied by buffer forwarding, and ing buffer forwarding, 
in 6(c) by a read from memory. These three relations all correspond to the same Shasha- 
Snir trace, shown in Fig. 7(a), and they all have the same observable behavior, since the 
value of the load is obtained from the same store. Hence, we should find a representation 
of executions that does not distinguish between these three cases. 

We can now describe chronological traces , our representation which solves the 
above problems, by omitting some of the edges, leaving some nodes unrelated. More 
precisely, edges between loads and updates should be omitted in the following cases. 

1. A load is never related to an update originating in the same thread. This captures 
the intuition that swapping the order of such a load and update has no effect other 
than changing a load from memory into a load of the same value from buffer, as 
seen when comparing Fig. 6(b) and 6(c). 

2. A load Id from a memory location x by a thread p is never related to an update by 
an another thread q, if the update by q precedes some update to x originating in a 
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Fig. 4: A trace for the execution in Fig. 2 
where updates are separated from stores. 
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Fig. 6: Three redundant happens-before relations for Fig. 5. 


store by p that precedes Id. This is because the value written by the update of q is 
effectively hidden to the load Id by the update to x by p. Thus, when we compare 
Fig. 6(a) and 6(b), we see that the order between the update by q and the load is 
irrelevant, since the update by q is hidden by the update by p (note that the update 
by p originates in a store that precedes the load). 


P <7 

store: x: = l <- store: x:=2 

load: $r: =x 

(a) A Shasha-Snir trace corresponding 
to all three traces of Fig. 6. 


store: x: = l 

N 


store: x:=2 


load: $r: =x 


update 


update 


(b) The three traces can be merged into this 
single trace. 


Fig. 7: Traces that capture all three Fig. 6(a), 6(b) and 6(c). 


When we apply these rules to the example of Fig. 5, all of the three representations 
in Fig. 6(a), 6(b), and 6(c) merge into a single representation shown in Fig. 7(b). In total, 
we reduce the number of distinguished cases for the program from six to three. This is 
indeed the minimal number of cases that must be distinguished by any representation, 
since the different cases result in different values being loaded by the load instruction 
or different values in memory at the end of the execution. Our proposed representation 
is optimal for the programs in Fig. 1 and 5. In Theorem 1 of Section 3 we will show 
that such an optimality result holds in general. 

Chronological Traces for PSO The TSO and PSO memory models are very similar. 
The difference is that PSO does not enforce program order between stores by the same 
thread to different memory locations. To capture this, chronological traces are con¬ 
structed differently under TSO and PSO. In particular, under TSO there will always be 
edges between all updates of the same thread, but under PSO we omit those edges when 












the updates access different memory locations. In Appendix C we describe in detail how 
to adapt the chronological traces described above to the PSO memory model. 

DPOR Based on Chronological Traces Here, we illustrate how stateless model check¬ 
ing performs DPOR based on chronological traces, in order to explore one execution 
per chronological trace. As example, we use the small program of Fig. 5. 

The algorithm initially explores an arbitrary execution of the program, and simulta¬ 
neously generates the corresponding chronological trace. In our example, this execution 
can be the one shown in Fig. 8(a), along with its chronological trace. The algorithm 
then finds those edges of the chronological trace that can be reversed by changing the 
thread scheduling of the execution. In Fig. 8(a), the reversible edges are the ones from 
p: update tog: update, and from p: load: $?’:=xtog: update. For each such edge, 
the program is executed with this edge reversed. Reversing an edge can potentially lead 
to a completely different continuation of the execution, which must then be explored. 

In the example, reversing the edge from p : load: $r: =x to q : update will generate 
the execution and chronological trace in Fig. 8(b). Notice that the new execution is 
observably different from the previous one: the load reads the value 2 instead of 1. 

The chronological traces in both Fig. 8(a) and 8(b) display a reversible edge from 
p : update to q : update. The algorithm therefore initiates an execution where q : 
update is performed before p : update. The algorithm will generate the execution and 
chronological trace in Fig. 8(c). 

Notice that the only reversible edge in Fig. 8(c) is the one from q : update to 
p : update. However, executing p : update before q : update has already been explored 
in Fig. 8(a) and Fig. 8(b). Since there are no more edges that can be reversed, SMC 
terminates, having examined precisely the three chronological traces that exist for the 
program of Fig. 5. 
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Fig. 8: How SMC with DPOR explores the program of Fig. 5. 













3 Formalization 


In this section we summarize our formalization of the concepts of Section 2. We intro¬ 
duce our representation of program executions, define chronological traces, formalize 
Shasha-Snir traces for TSO, and prove a one-to-one correspondence between chrono¬ 
logical traces and Shasha-Snir traces. The formalization is self-contained, but for lack 
of space, we sometimes use precise English rather than formal notation. A more fully 
formalized version can be found in Appendix A. 

Parallel Programs We consider parallel programs consisting of a number of threads 
that run in parallel, each executing a deterministic code, written in an assembly-like 
programming language. The language includes instructions store: x:=$r, load: $r:=x, 
and fence. Other instructions do not access memory, and their precise syntax and se¬ 
mantics are ignored for brevity. Here, and in the remainder of this text, x, y, z are used 
to name memory locations, u, v, w are used to name values, and Sr, $s, $£ are used to 
name processor registers. We use TID to denote the set of all thread identifiers. 

Formal TSO Semantics We formalize the TSO model by an operational semantics. 
Define a configuration as a pair (L, M), where M maps memory locations to values, 
and L maps each thread p to a local configuration of the form L(p) = (R, B), where R 
is the state of local registers and program counter of p, and B is the contents of the store 
buffer of p. This content is a word over pairs (x, v) of memory locations and values. 
We let the notation B(x) denote the value v such that (x, v) is the rightmost pair in B of 
form (x, _). If there is no such pair in B, then B(x) =_L. 

In order to accommodate memory updates in our operational semantics, we assume 
that for each thread p £ TID, there is an auxiliary thread upd(p), which nondeter- 
ministically performs memory updates from the store buffer of p. We use AuxTID = 
{upd(p)|p £ TID} to denote the set of auxiliary thread identifiers. We use p and q to 
refer to real or auxiliary threads in TID U AuxTID as convenient. 

For configurations c = (L, M) and d = (L 7 . M 7 ), we write c 4 c 7 to denote that 
from configuration c, thread p can execute its next instruction, thereby changing the 
configuration into c'. Let L(p) = (M, B), and R pc be obtained from R by advancing the 
program counter after p executes its next instruction. Depending on this next instruction 
op, we have the following cases. 

Store: If op has the form store: x :=$?’, then c A c 7 iff M 7 = M and L 7 = L [p 
(R pc ,B • (x, u))] where v = R($r), i.e., instead of updating the memory, we insert the 
entry (x, v) at the end of the store buffer of the thread. 

Load: If op has the form load: $?’: =x, then M 7 = M and either 

1. (From memory) B(x) =_L and L 7 = L [p (R pc [$r 4-^ M(x)],B)], i.e., there is 
no entry for x in the thread’s own store buffer, so the value is read from memory, or 

2. (Buffer forwarding) B(x) ^_L and L 7 = L[p t— 1 (R pc [$?’ B(x)],B)], i.e., p 

reads the value of x from its latest entry in its store buffer. 


Fence: If op has the form fence, then c A c! iff B = e and M 7 = M and L 7 = L \p 
(R pc ,B)]. A fence can only be executed when the store buffer of the thread is empty. 


Update: In addition to instructions which are executed by the threads, at any point 
when a store buffer is non-empty, an update event may nondeterministically occur. The 
memory is then updated according to the oldest (leftmost) letter in the store buffer, 
and that letter is removed from the buffer. To formalize this, we will assume that the 

auxiliary thread upd(p) executes a pseudo-instruction u(x). We then say that c up P - > 
c! iff B = (x, v) • B' for some x, v, B' and M' = M[x v ] and L' = L [p (R,B')]. 

Program Executions A program execution is a sequence c (J -^4 cj —> • • • c n 
of configurations related by transitions labelled by actual or auxiliary thread IDs. Since 
each transition of each program thread (including the auxiliary threads of form upd(g)) 
is deterministic, a program run is uniquely determined by its sequence of thread IDs. We 
will therefore define an execution as a word of events. Each event represents a transition 
in the execution as a triple (p. i. j ), where p is a regular or auxiliary thread executing an 
instruction i (which can possibly be an update), and the natural number j is such that 
the event is the j th event of p in the execution. 

Chronological Traces We can now introduce the main conceptual contribution of the 
paper, viz. chronological traces. For an execution r we define its chronological trace 
Tc(t) as a directed graph ( V. E). The vertices V are all the events in r (both events 
representing instructions and events representing updates). The edges are the union of 
six relations: E = —>p° U — U —U —*y c " ct U —U — . These edge relations 
are defined as follows, for two arbitrary events e = ( p , i, j), e! = (p', i ’, j 1 ) £ V: 

Program Order: e— >P°e' iff p = p' and j' = j + 1, i.e., e and e' are consecutive events 
of the same thread. 

Store to Update: e-^Ffle! iff e' is the update event corresponding to the store e. 

Update to Update: e-^flfle! iff i = u(x) and i' = u(x) for some x, and e and e! are 
consecutive updates to the memory location x. 

Source: e— >^ rc ' ct e' iff e' is a load which reads the value of the update event e, which 
is from a different process. Notice that this definition excludes the possibility of p = 
upd ('//); a load is never src-related to an update from the same thread. 

Conflict: e —iff e' is the update that overwrites the value read by e. 

Update to Fence: e—t'pe 1 iff i = u(x) for some x, and i! = fence and p = upd(p') 
and e is the latest update by p which occurs before e! in r. The intuition here is that the 
fence cannot be executed until all pending updates of the same thread have been flushed 
from the buffer. Hence the updates are ordered before the fence, and the chronological 
trace has an edge from the last of these updates to the fence event. 

Shasha-Snir Traces We will now formalize Shasha-Snir traces, and prove that chrono¬ 
logical traces are equivalent to Shasha-Snir traces, in the sense that they induce the same 
equivalence relation on executions. We first recall the definition of Shasha-Snir traces. 
We follow the formalization by Bouajjani et al. [7], 

First, we introduce the notion of a completed execution. We say that an execution 
r is completed when all stores have reached memory by means of a corresponding up¬ 
date event. In the context of Shasha-Snir traces, we will restrict ourselves to completed 
executions. 



For a completed execution r, we define the Shasha-Snir trace of r as the graph 
T(t) = ( V , E) where V is the set of all non-update events (p, i,j) in r (i.e., i ^ u(x) 
for all x). The edges E is the union of four relations E = —U —U —>® rc " ss U —»!; f ~ ss , 
where —>p° (program order) is the same as for Chronological traces, and where, letting 

e = (p,i,j) and e' = ('): 

Store Order: iff i and i' are two stores, whose corresponding updates are con¬ 

secutive updates to the same memory location. I.e., store order defines a total order on 
all the stores to each memory location, based on the order in which they reach memory. 

Source: e—t* rc ' ss e' iff e! is a load which reads its value from e, via memory or by buffer 
forwarding. 

Conflict: e-tf^e' iff e! is the store which overwrites the value read by e. 

We are now ready to state the equivalence theorem. 

Theorem 1. (Equivalence of Shasha-Snir traces and chronological traces) For a 

given program V with two completed executions r, r ', it holds that T(t) = T(t') iff 
Tc(t) = Tc(r'). 

The proof is found in Appendix A. 

DPOR for TSO A DPOR algorithm can exploit Chronological traces to perform state¬ 
less model checking of programs that execute under TSO (and PSO), as illustrated at the 
end of Section 2. The explored executions follow the semantics of TSO in Section 3. 
For each execution, its happens-before relation, which is the transitive closure of the 
edge relation E = —>p° U U —)y u U ->^. rc " ct U —>y" ct U —of the corresponding 
chronological trace, is computed on the fly. Such a computation is described in more 
detail in Appendix B. This happens-before relation can in principle be exploited by any 
DPOR algorithm to explore at least one execution per equivalence class induced by 
Shasha-Snir traces. We state the following theorem of correctness. 

Theorem 2. (Correctness of DPOR algorithms) The algorithms Source-DPOR and 
Optimal-DPOR of [1], based on the happens-before relation induced by chronological 
traces, explore at least one execution per equivalence class induced by Shasha-Snir 
traces. Moreover, Optimal-DPOR explores exactly one execution per equivalence class. 

The proof is found in Appendix B. 


4 Implementation 

To show the effectiveness of our techniques we have implemented a stateless model 
checker for C programs. The tool, called Nidhugg, is available as open source at https: 
//github. com/nidhugg/nidhugg. Major design decisions have been that Nidhugg: 
(i) should not be bound to a specific hardware architecture and (ii) should use an ex¬ 
isting, mature implementation of C semantics, not implement its own. Our choice was 
to use the LLVM compiler infrastructure [22] and work at the level of its intermedi¬ 
ate representation (IR). LLVM IR is low-level and allows us to analyze assembly-like 


but target-independent code which is produced after employing all optimizations and 
transformations that the LLVM compiler performs till this stage. 

Nidhugg detects assertion violations and robustness violations that occur under the 
selected memory model. We implement the Source-DPOR algorithm from Abdulla 
et al. [1], adapted to relaxed memory in the manner described in this paper. Before 
applying Source-DPOR, each spin loop is replaced by an equivalent single load and as¬ 
sume statement. This substantially improves the performance of Source-DPOR, since 
a waiting spin loop may generate a huge number of improductive loads, all returning 
the same wrong value; all of these loads will cause races, which will cause the number 
of explored traces to explode. Exploration of program executions is performed by in¬ 
terpretation of LLVM IR, based on the interpreter Ili which is distributed with LLVM. 
We support concurrency through the pth reads library. This is done by hooking calls to 
pthread functions, and executing changes to the execution stacks (adding new threads, 
joining, etc.) as appropriate within the interpreter. 


5 Experimental Results 


We have applied our implementation to several intensely racy benchmarks, all imple¬ 
mented in C/pthreads. They include classical benchmarks, such as Dekker’s, Lam¬ 
port’s (fast) and Peterson’s mutual exclusion algorithms. Other programs, such as in¬ 
dexer. c, are designed to showcase races that are hard to identify statically. Yet oth¬ 
ers (stack_safe.c) use pthread mutexes to entirely avoid races. Lamport’s algorithm 
and stack safe.c originate from the TACAS Competition on Software Verification (SV- 
COMP). Some benchmarks originate from industrial code: apr_l.c, apr_2.c, pgsql.c and 
parker. c. 

We show the results of our tool Nidhugg in Table 1 . Lor comparison we also in¬ 
clude the results of two other analysis tools, CBMC [5] and goto-instrument [4], which 
also target C programs under relaxed memory. The techniques of goto-instrument and 
CBMC are described in more detail in Section 6. 

All experiments were run on a machine equipped with a 3 GHz Intel i7 processor 
and 6 GB RAM running 64-bit Linux. We use version 4.9 of goto-instrument and 
CBMC. The benchmarks have been tweaked to work for all tools, in communication 
with the developers of CBMC and goto-instrument. All benchmarks are available at 
https://github.com/nidhugg/benchmarks_tacas2015. 

Table 1 shows that our technique performs well compared to the other tools for most 
of the examples. We will briefly highlight a few interesting results. 

We see that in most cases Nidhugg pays a very modest performance price when 
going from sequential consistency to TSO and PSO. The explanation is that the num¬ 
ber of executions explored by our stateless model checker is close to the number of 
Shasha-Snir traces, which increases very modestly when going from sequential consis¬ 
tency to TSO and PSO for typical benchmarks. Consider for example the benchmark 
stack_safe.c, which is robust, and therefore has equally many Shasha-Snir traces (and 
hence also chronological traces) under all three memory models. Our technique is able 
to benefit from this, and has almost the same run time under TSO and PSO as under SC. 
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Table 1: Analysis times (in seconds) for our implementation Nidhugg, as well as CBMC 
and goto-instrumerit under the SC, TSO and PSO memory models. Stars (*) indicate 
that the analysis discovered an error in the benchmark. A t/o entry means that the tool 


did not terminate within 10 minutes. An ! entry means that the tool crashed. Struck out 
entries mean that the tool gave the wrong result. In the fence column, a dash (-) means 
that no fences have been added to the benchmark, a memory model indicates that fences 
have been (manually) added to make the benchmark correct under that and stronger 
memory models. The LB column shows the loop unrolling depth. Superior run times 
are shown in bold face. 




We also see that our implementation compares favourably against CBMC, a state- 
of-the-art bounded model checking tool, and goto-instrument. For several benchmarks, 
our implementation is several orders of magnitude faster. 

The effect of the optimization to replace each spin loop by a load and assume state¬ 
ment can be seen in the pgsql.c benchmark. For comparison, we also include the bench¬ 
mark pgsqLbnd.c, where the spin loop has been modified such that Nidhugg fails to 
automatically replace it by an assume statement. 

The only other benchmark where Nidhugg is not faster is fib_true.c. The benchmark 
has two threads that perform the actual work, and one separate thread that checks the 
correctness of the computed value, causing many races, as in the case of spin loops. We 
show with the benchmark fib_true_join.c that in this case, the problem can be alleviated 
by forcing the threads to join before checking the result. 

Most benchmarks in Table 1 are small program cores, ranging from 36 to 118 lines 
of C code, exhibiting complicated synchronization patterns. To show that our technique 
is also applicable to real life code, we include the benchmarks apr_l.c and apr 2.c. 
They each contain approximately 8000 lines of code taken from the Apache Portable 
Runtime library, and exercise the library primitives for thread management, locking, and 
memory pools. Nidhugg is able to analyze the code within a few seconds. We notice 
that despite the benchmarks being robust, the analysis under PSO suffers a slowdown 
of about three times compared to TSO. This is because the benchmarks access a large 
number of different memory locations. Since PSO semantics require one store buffer 
per memory location, this affects analysis under PSO more than under SC and TSO. 


6 Related Work 

To the best of our knowledge, our work is the first to apply stateless model check¬ 
ing techniques to the setting of relaxed memory models; see e.g. [1] for a recent sur¬ 
vey of related work on stateless model checking and dynamic partial order reduction 
techniques. There have been many works dedicated to the verification and checking of 
programs running under RMM (e.g., [17,19,21,2,9,10,6,7,8,35]). Some of them propose 
precise analyses for checking safety properties or robustness of finite-state programs un¬ 
der TSO (e.g., [2,7]). Others describe monitoring and testing techniques for programs 
under RMM (e.g., [9,10,21]). There are also a number of efforts to design bounded 
model checking techniques for programs under RMM (e.g., [35,8]) which encode the 
verification problem in SAT. 

The two closest works to ours are those presented in [5,4], The first of them [5] 
develops a bounded model checking technique that can be applied to different mem¬ 
ory models (e.g., TSO, PSO, and Power). That technique makes use of the fact that the 
trace of a program under RMM can be viewed as a partially ordered set. This results 
in a bounded model checking technique aware of the underlying memory model when 
constructing the SMT/SAT formula. The second line of work reduces the verification 
problem of a program under RMM to verification under SC of a program constructed 
by a code transformation [4], This technique tries to encode the effect of the RMM 
semantics by augmenting the input program with buffers and queues. This work intro¬ 
duces also the notion of Xtop objects. Although an Xtop object is a valid acyclic rep- 


resentation of Shasha-Snir traces, it will in many cases distinguish executions that are 
semantically equivalent according to the Shasha-Snir traces. This is never the case for 
chronological traces. More details on the comparison with Xtop objects can be found in 
Appendix D. An extensive experimental comparison with the corresponding tools [5,4] 
for programs running under the TSO and PSO memory models was given in Section 5. 


7 Concluding Remarks 

We have presented the first technique for efficient stateless model checking which is 
aware of the underlying relaxed memory model. To this end we have introduced chrono¬ 
logical traces which are novel representations of executions under the TSO and PSO 
memory models, and induce a happens-before relation that is a partial order and can 
be used as a basis for DPOR. Furthermore, we have established a strict one-to-one 
correspondence between chronological and Shasha-Snir traces. Nidhugg, our publicly 
available tool, detects bugs in LLVM assembly code produced for C/pthreads programs 
and can be instantiated to the SC, TSO, and PSO memory models. We have applied 
Nidhugg to several programs, both benchmarks and of considerable size, and our ex¬ 
perimental results show that our technique offers significantly better performance than 
both CBMC and goto-instrument in many cases. 

We plan to extend Nidhugg to more memory models such as Power, ARM, and the 
C/C++ memory model. This will require to adapt the definition chronological traces to 
them in order to also guarantee the one-to-one correspondence with Shasha-Snir traces. 
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Fig. 9: Illustration of the definitions of upd st () and upd| d (). 


A Executions and Traces 

In this appendix we introduce our representation of program executions and define 
chronological traces. We also formalize Shasha-Snir traces for TSO and prove that 
there is a one-to-one correspondence between our chronological traces and Shasha-Snir 
traces. For completeness, we first recall, in slightly more detail, our model of TSO and 
our definition of executions. 


A.l Concurrent Programs 

We consider parallel programs running on shared memory under TSO. We assume that 
a number of threads run in parallel, each executing a deterministic code. We assume an 
assembly-like programming language for the code. The language includes instructions 
store: x:=$r, load: $r:=x, and fence. Other instructions do not access memory, and 
their precise syntax and semantics are ignored in this text for brevity. Here, and in the 
remainder of this text, x, y, z are used to name memory locations, u, v, w are used to 
name values, and Sr, $s, St are used to name registers local to processor cores. Where 
convenient, we will use the short forms st(x) and ld(x) to denote some store and load 
of x respectively, where the value is not interesting. We will use TID to denote the set 
of all thread identifiers, and MemLoc to denote the set of all memory locations. 

A.2 TSO Semantics 

We now define the semantics of our programming language when running under TSO. 
For a function /, we use the notation f[x v] to denote the function /' such that 
f'(x) = v and f(y) = f(y ) whenever y ^ x. We use w-w' to denote the concatenation 
of the words w and w'. 



Let the program V be given. We define a system configuration as a pair (L,M), 
where L (p) defines the thread local configuration for each thread p , and M(x) is the 
current value at x in memory. The local configuration of each thread is defined by 
L (p) = (M, B). Here R is some structure which contains local information such as cur¬ 
rent register valuation (denoted R($r)) and current program counter (denoted R(pc)). 
B represents the contents of the store buffer of the thread p. It is a word over pairs (x, v ) 
of memory locations and values. We let the notation B(x) denote the value v such that 
(x, v) is the rightmost letter in B which is of the form (x, _). If there is no such letter in 
B, then we say that B(x) =_L. 

In order to accommodate memory updates in our operational semantics we will 
introduce the notion of auxiliary threads. For each thread pe Tl D, we assume that there 
is an auxiliary thread upd(p). The auxiliary thread upd(p) will nondeterministically 
perform memory updates from the store buffer of p, when the buffer is non-empty. We 
use AuxTID = {upd(p)|p £ TID} to denote the set of auxiliary thread identifiers. We 
will use p and q to refer to real or auxiliary threads in TID U AuxTID as convenient. 

For configurations c = (L,M) and d = (L 7 , M 7 ), we use the notation c A d to 
denote that from configuration c, thread p can execute its next instruction, and doing so 
will change the system configuration into d. We define the transition relation c A d 
depending on what the next instruction op of p is in c. In the following we assume 
c = (L, M) and c' = (L^M 7 ) and L(p) = (R,B). We let K pc = K[pc pc'] where 
pc' is the next program counter of p after executing op. 

Store: If op has the form store: x :=$?’, then c A c' iff M' = M and id = L [p 
(M pc ,B • (x,t>))] where v = K($?’). Intuitively, under TSO, instead of updating the 
memory with the new value v, we insert the entry (x, v ) at the end of the store buffer of 
the thread. 

Load: If op has the form load: $r: =x, then c4c' iff either 

1. (From memory) B(x) =_L and M' = M and L' = L [p (R pc [$?’ M(x)], B)]. 

Intuitively, there is no entry for x in the thread’s own store buffer, so the value is 
read from memory. 

2. (Buffer forwarding) B(x) ^_L and M' = M and L' = L [p (R pc [$?’ ^ 

B(x)], B)]. Intuitively, we read the value of x from its latest entry in the store buffer 
of the thread. 

Fence: If op has the form fence, then c A d iff B = e and = M and 17 = L [p 
(M pc , B)]. A fence can be only executed when the store buffer of the thread is empty. 

Update: In addition to instructions which are executed by the threads, at any point 
when a store buffer is non-empty, an update event may nondeterministically occur. The 
memory is then updated according to the oldest (leftmost) letter in the store buffer, 
and that letter is removed from the buffer. To formalize this, we will assume that the 

auxiliary thread upd(p) executes a pseudo-instruction u(x). We then say that c up P > 
d iff B = (x, v) ■ B' for some x, v, B' and M' = M[x v] and L' = L [p <r J (JR,B')]. 



A.3 Program Executions 


Based on the operational semantics in Section A.2, a program execution can be de¬ 
fined as a sequence cq C\ c n of configurations related by transi¬ 

tions labelled by actual or auxiliary thread IDs. Since each transition of each program 
thread (including the auxiliary threads of form upd(g)) is deterministic, a program run 
is uniquely determined by its sequence of thread IDs. Formally, we will therefore define 
each execution as a word of events. Each event is a triple (p, i,j) which represents one 
transition in the run. Here the thread pe TID U AuxTID can be either a regular thread 
p £ TID executing an instruction i, or an auxiliary thread p £ AuxTID performing an 
update to memory from a store buffer. In the latter case, * = u(x) is the instruction that 
denotes an update to a memory location. The natural number j is used to disambiguate 
events. We let j equal one plus the number of preceding triples (p', in the execu¬ 
tion with p' = p. For an event e = (p, i, j), we define tid(e) = p. We will use Event to 
denote the set of all possible events. Figure 9 shows three sample executions. 

For an execution r and two events e, e! in r, we say that e < T e' iff e strictly 

precedes e' in r. We define two dummy events e° = (J_L, — 1) and e°° = (_L, _L, oo), 

and we extend < T such that for every event e ^ {e°, e°°} we have e° < T e < T e°°. 

For an execution r and an event e = (p, st(x), j) in r, we define upd st (e) to be the 
update event in r corresponding to the store event e. Formally, let k be the number of 
events e w = (p', st(y), j') for any memory location y in r such that p' = p and j' < j. 
Then upd st (e) = (upd(p), u(x), k) if there is such an event in r. Otherwise upd st (e) = 
e°°, denoting that the update is still pending at the end of r. Figure 9(a) illustrates 
the typical case, where the store e s is eventually followed by its corresponding update 
upd st (e s ) = e u . Figure 9(b) shows the case when the update corresponding to the store 
e s is still pending at the end of the execution, and therefore upd st (e s ) = e°°. 

For an execution r and an event e = (p, ld(x),_)) in r, we define upd|d(e) to be the 
update event of the latest store to x, which precedes e in the same thread. The intuition 
is that upd|d(e) is the update from which e would get its value in the case of buffer 
forwarding. Formally, if there is an event e w = (p, st(x), k ) in r such that k < j and 
there is no event (p, st(x), () in r with k < l < j, then upd|d(e) = upd st (e u; ). Otherwise 
upd|d(e) = e°. Figures 9(a) and 9(b) show the typical case, where updid(ez) is taken to 
be the update corresponding to the latest preceding store by the same thread to the same 
memory location. Figure 9(c) shows the case when there is no such preceding store, and 
updid(ez) is taken to be the dummy event e°. (Notice that the store e s is to a different 
memory location.) 

A.4 Chronological Traces 

We can now introduce the main conceptual contribution of the paper, viz. chronological 
traces. For an execution r we define its chronological trace 7c (t) as a directed graph 
(V, E). The vertices V are all the events in r; both events representing instructions and 
events representing updates. The edges are the union of six relations: E = —>p° U —)y u U 

^□ u u ^src-ct u _j,cf-ct u 

We will illustrate the definition on an execution of the program in Figure 10(a), 
which contains an idiom that occurs in the mutual exclusion algorithm of Peterson. It 



is mostly the same as that from Dekker’s mutual exclusion algorithm. But it has two 
additional accesses in each thread to a separate memory location z. These provide an 
opportunity to display buffer forwarding. Figure 10(c) shows an example of an execu¬ 
tion and Figure 10(b) shows its corresponding chronological trace. 

We define the edge relations of chronological traces as follows, for two arbitrary 
events e = ( p,i,j),e' = (') G V: 

Program Order: iff p = p' and j' = j + 1. For example, in Figure 10(b) there 

is a program order edge from the store instruction (p, st(x), 1) to the store instruction 
(p, st(z), 2) which immediately follows it in the program of thread p. Similarly, the up¬ 
dates of each thread are program ordered. E.g., (upd(p), u(x), 1)—>P°(upd(p), u(z), 2). 

Store to Update: e—> s fle' iff i = st(x) for some x and upd 5t (e) = e'. I.e., e! is the 
update corresponding to the store e. This is illustrated in Figure 10(b) where there is an 
su-edge from each store, to its corresponding update. 

Update to Update: e—>“ u e' iff i = u(x) and i' = u(x) for some x and e < T e! and there 
is no event e" = (p", u(x), j") such that e < T e" < T e'. I.e., —>-“ u defines the total, 
chronological order on updates to each memory location. In Figure 10(b) we see that 
the two updates to z are uu-ordered with each other in the same order as they appear in 
the execution in Figure 10(c). However, they are not uu-ordered with the updates to x 
and y. 

Source: e— ►* rc " ct e / iff for some x it holds that i = u(x) and [’ = ld(x) and updid(e') < r 
e < T e' and there is no update e" = (p"■ u(x), j") to x such that e < T e" < T e'. I.e., if 
the source of the value read by e! is an update e from a different process, then e—>^. rc ‘ ct e / . 
Otherwise, there is no incoming —)-^ rc ' ct edge to e !. Notice that this definition excludes 
the possibility of p = upd(p'); a load is never src-related to an update from the same 
thread. In Figure 10(b) we see that the load ( q , Id (x), 4) takes its value from the update 
(upd(p), u(x), 1). Therefore the events are src-related. But the loads to z both read the 
value written by their own thread, and therefore have no src-relation. 

Conflict: e—iff i = ld(x) and i' = u(x) for some x and e' is the first (w.r.t. 
< T ) event e u of the form (_, u(x),_) such that both e < r e u and upd|d(e) < T e u . 
The intuition here is that e—when e! is the first update which succeeds e in the 
coherence order of x. Equivalently, e! is the update that overwrites the value that was 
read by e. In Figure 10(b), the load to y by p reads the initial value of y, which is then 
overwritten by the update to y by q. Therefore the load has a cf-edge to the update. The 
load to z by p reads the value of (upd(p), u(z), 2) by buffer forwarding. That value is 
later overwritten in memory by the update (upd(g), u(z),2). Therefore the load has a 
cf-edge to the update originating in thread q. 

Update to Fence: iff i = u(x) for some x, and i' = fence and p = upd(p') and 

e < T e! and there is no event e" = (p, u(y), j") for any y such that e < T e" < T e!. 
The intuition here is that the fence cannot be executed until all pending updates of the 
same thread have been flushed from the buffer. Hence the updates are ordered before 
the fence. 


A.5 Shasha-Snir Traces 


We will now prove that chronological traces are equivalent to Shasha-Snir traces, in the 
sense that there is a one-to-one mapping between Shasha-Snir traces T and chronolog¬ 
ical traces Tc such that the set of executions corresponding to T is the same as the set 
of executions corresponding to Tc. 

We briefly recall the definition of Shasha-Snir traces, based on the definition by 
Bouajjani et al. [7], 

First, we introduce the notion of a completed execution. We say that an execution 
r is completed when all stores have reached memory, i.e., when for every event e = 
(p, st(x), j) in r we have upd st (e) e°°. In the context of Shasha-Snir traces, we will 
restrict ourselves to completed executions. 

For a completed execution r, we define the Shasha-Snir trace of r as the graph 
T (t) = ( V,E) where V is the set of all non-update events ( p , i, j ) in r where i ^ u(x) 
for all x. The edges E is the union of four relations E = —>p° U — U —)-^. rc ' ss U —*£ f ' ss . 

For two arbitrary events e = (p, i,j),e' = (p', i',j') € V, we define the relations 
as follows: 

Program Order: e— t p °e! iff p = p' and j' = j + 1. This is the same program order as 
for chronological traces. 

Store Order: e—^e' iff i = st(x) and i' = st(x) and the corresponding updates 
are ordered in r s.t. upd st (e) < T upd st (e') and there is no other update event e" = 
(p", u(x),j") such that upd st (e) < T e" < T upd st (e / ). I.e., store order defines a total 
order on all the stores to each memory location, based on the order in which they reach 
memory. 

Source: e-^ s f c ~ ss e' iff i' = ld(x) and e is the maximal store event e" = (p", st(x), j") 
with respect to —►ij?* such that either upd st (e") < T e' or e"— >v°*e'. I.e., e—)-^. rc ' ss e' 
when e! is a load which reads its value from e, via memory or by buffer forwarding. 

Conflict: e—►^■ ss e / iff i = ld(x) and i! = st(x) and if there is an event e" such that 
e"—>> c ' ss e then e!' —>^e' , otherwise e' has no predecessor in h^ 1 . I.e., e' is the store 
which overwrites the value that was read by e. 

The definition of Shasha-Snir traces is illustrated in Figure 10(d). We are now ready 
to state the equivalence theorem. 

Theorem 1. (Equivalence of Shasha-Snir traces and chronological traces) For a 

given program V with two completed executions t,t it holds that T(t) = T{t') iff 
Tc{t) = Tc(r'). 

Proof We decompose the theorem into the following two lemmas, which are proven 
separately. 

Lemma 1. (Equivalence of Shasha-Snir traces and chronological traces: => di¬ 
rection) For a given program V with two completed executions r, t', it holds that if 
T(t) = T(V) then Tc(t) = Tc(t'). 


Lemma 2. (Equivalence of Shasha-Snir traces and chronological traces: di¬ 

rection) For a given program V with tw’o completed executions t, t' , it holds that if 
T C (r) = Tc{t') then T(t) = T(r'). 


Proof of Lemma 1 Let two completed executions r and t' be given. Let 

T(t) = (Vss, ~^ p r° u ^ S r ‘ U -^ s ; c - ss U ss ) and 

T(t') = {V&s, -> P T ° U -►*, U ^ s ;, c - ss U -^- ss ) and 

T c (t) = (V c , ->p° U -> S T U U U ^= rc - ct U ->f ct U -^ f ) and 

Tc(r') = (Vf, -t po , u ^ s r u , u -►“? u ^ s ;, c - ct U ^‘ ct U 

Furthermore, assume that T(t) = T(t'). 

First, we determine that the events are the same in both chronological traces: Vc = 
V' c . From Vss = Vss we have that the non-update events in r are the same as the ones 
in t'. Since r and t' contain the same stores for each thread in the same per-thread 
order, it follows from the completedness of r and r', and from the TSO semantics that 
r and t' also have the same update events. Hence Vc = V' c . 

We see that the definitions of program order and store to update order in chrono¬ 
logical traces are entirely determined by which events exist in the execution for each 
thread. Since both executions have the same events, we conclude that — >p° = ~^ p ° and 
—>^. u = —. The equality of update to fence order follows similarly. 

Let us consider the definitions of update to update order for chronological traces and 
store order for Shasha-Snir traces. We see that there is a one-to-one mapping between 
relations e—V£e' for stores in Shasha-Snir traces to relations upd st (e)—^ U upd st (e') in 
chronological traces. Since the store orders are the same for r and r', we thus conclude 
that the update to update orders are also the same: —>-" u = — 

We now turn our attention to proving that —>^ rc ~ ct = —>^ r , c ~ ct . We will prove that 
—>® rc -ct c —>y, c " ct . From symmetry it then follows that —>-® r , c ' ct C —>-* rc " ct , and hence 
—>;j. rc ' ct = —>-^. r , c ' ct . Let us assume that the relation e—>y c ' ct e / exists in —»* rc ' ct for some 
events e = ( p, u(x), j) and e = (p', Id(x), j'). We will prove that the same relation 
e—exists in —>-^ r , c ' ct . From the definition of —)-* rc ' ct we have that updid(e') < T 
e < T e' and there is no update e" = (p", u(x), i") to the same memory location such 
that e < T e" < T e'. Since e' is preceded in r by at least one update to x, there must be a 
store event e w such that e w —> s f c ~ ss e' in r. From the definition of —>^. rc ' ss we have that e w 
is the maximal event (p", st(x), j") with respect to —such that either upd st (e^,) < T 
e' or e w —. If e w —then upd st (e,„) = updid(e'). But then the maximality of 
e w contradicts updid(e') < T e < T e!. Hence we have upd st (e iiI ) < T V. Maximality 
of e w now gives that upd st (e u ,) = e. Since —>-^ rc ' ss = —>-^. r , c ' ss we have that in t' also 
e w —> s f?~ ss e'. From the definition of —»;j. r , c ' ss and -i{e w —V‘°,*e') we know that upd st (e lu ) 
is the store-order-maximal update to x that precedes e! in t'. Since the store order is the 
same for r and t' we have updid(e') < T > e. But then e = upd st (e,„) satisfies the criteria 
for e— Vf?- Ct e'. 

Finally, we will show that —Similarly to the proof for —>^ rc ' ct = 
—>^. r , c ' ct , it suffices here to show that — V^~ ct C —>^; ct . Assume therefore that e r —>^' ct e u 
for some events e r = (p, ld(x), j), e u = (pi, u(x), j'). We will show that e r —rf ct e u . 
The definition of —>^. f " ct gives that e u is the first (w.r.t. < T ) event e of the form (_, u(x), _) 
such that both e r < T e and upd|d(e r ) < r e. Let e w be the store event such that 


upd st (e,„) = e u . We will split the proof in cases depending on whether or not there 
exists a source event for e r in the Shasha-Snir traces. 

Assume therefore first (i) that there is no event e src such that e src —>^ c ' S5 e r . Then 
there is no update to x that precedes e r in < T . Furthermore upd|d(e r ) = e°. This tells 
us that e w has no predecessor in —>^. t . Since —we also have that e w has no 
predecessor in —. Furthermore, since e r has no source event in r 7 , it must be the case 
that e r < T > e u . But then, e u is the first update event in t 7 which is after both e r and 
upd| t j(e r ). And so we have e r -^^., ct e u . 

Next assume (ii) that there is an event e src with e src —>^ c ~ ss e r and that tid(e src ) = 
tid(e r ). Then it must be the case that upd st (e src ) = upd|d(e r ). Since —>-® rc ' ss = —7ij. r , c ' ss , 
we have that e src —> s fP ss e r . There can be no update event e to the same memory location 
x such that upd|d(e r ) < T e < T e r . If there were such an e, then e src wouldn’t be the 
source of e r . The same argument goes in r 7 . This tells us that e u is the immediate store 
order successor of upd|d(e r ), i.e., upd|d(e r )— > u fe u and e src —Vfe w . Since — 
we have upd|d(e r )— > u f,e u . Hence e u is the first update event which succeeds both e r 
and upd|d(e r ) in < T /. Thus e r — > c ^ ct e u . 

Lastly, we assume (iii) that there is an event e src such that e src —>f c ~ ss e r and that 
tid(e src ) 7 ^ tid(e r ). Then it is the case in r that upd|d(e r ) < T upd st (e src ) < T e r . 
And there is no update event e to x such that upd 5 t (e src ) < T e < T e r . The same 
holds in r 7 . Since e u is the first update to x after e r in r, this means that we have 
upd st (e src )— >“ u e u . We have —)-^ u = —so upd st (e src )— >^e u . Now it must be the 
case that e r < T > e u . Otherwise, e src wouldn’t be the source of e r in t', and we know 
esrc ^src-ss er j_[ ence 6u j s an update event that succeeds both e r and upd|d(e r ) in < T '. 
It remains to show that it is the first such update. Suppose e 7 ^ e u is an update event to 
x such that e r < T e < T e u . Then it would be the case that upd st (e src ) < r / e < T / e u . 
But this would contradict upd st (e ST . c )— >^e u . Thus we have e r — >^f s e u . 

This concludes the proof of 7 c (t) = Tc(j')- 

Proof of Lemma 2 Let two completed executions r and t' be given. Let 

T (r) = ( V SS U ^ rc - ss U is ) and 

T{t') = {Vg S , U U ^ s ;, c - ss U -^ ss ) and 

T c (t) = ( V c , U ^ S T U U U ->= rc ‘ ct U -+<*■* a U and 

Tc(r') = (Vf, ->p? u ^ s r u , u u ^ s ;, c - ct U ^‘ ct U -►£>. 

Furthermore, assume that Tc(t) = Tc(j'). 

We will prove that T(r) = T (r 7 ). We know that Vss (respectively Vg S ) is precisely 
the non-updates of Vc (respectively V' c ). Since Vc = V' c we have Vss = Vg S . 

For the relations —>p° and —a reasoning analogue to that in the =>■ direction gives 
that ^p° = and 

We will show that —A rc " 55 C —^T 55 . Symmetry then gives —>® r , c " ss C —>^. rc " ss , and 
hence — 7 * r<>ss = —^ r , c ' ss . Assume therefore that e w — >y c ' ss e r holds for some events 
e w = (p, st(x), j) and e r = (p 7 , Id (x), j'). Then by the definition of —>® rc ' ss we have 
that e w is the maximal event e = (p 77 , st(x), j 77 ) with respect to —such that either 
upd st (e) < T e r or e— >^.°*e r . We will separate the proof by cases: either tid(eu,) = 
tid(e r ) ortid(eu,) 7 ^ tid(e r ). 

Assume first (i) that tid(e UJ ) = tid(e r ). Then it holds that e™—>-P°*e r , since the 
events must be program ordered, and the other direction implies e r < T upd 5 t (e, i; ). 


Program order is the same in r' as in r, so we also have e w —> p °,*e r . It remains to show 
that e w is maximal in r'. First we conclude that there can be no store event e such 
that e w —and e— > p °*e r . This is because both the program order and the store order 
are the same in r' as in r, and hence such an event e would contradict the assumed 
maximality of e w w.r.t. r. As a corollary we have upd|d(e r ) = upd st (e l „). Next we 
need to conclude that there is no event e such that e w —^,e and upd st (e) < T / e r . We 
know that there is no such event in r: i.e., there is no event e such that e w —>^e and 
upd st (e) < T e r . Hence by the definition of —)-® rc ' ct there is no event e^ rc which is source 
related with e r in the chronological trace: e^ rc — >-® rc ' ct e r . Since —^ rc ‘ ct = —)-® r , c " ct , the 
same holds in t'. Now if there were an event such as e in t', then e r would have a source 
according to —>^ r , c_ct . This is a contradiction, and so there can be no such e in t'. Hence, 
e w is the maximal store event w.r.t. —which is either updated < r /-before e r or 
program order-before e r . That concludes the proof for the case that tid(e„,) = tid(e r ). 

Next assume (ii) that tid(e u ,) 7^ tid(e r ). Clearly e w is not program ordered with 
e r . Hence the definition of —>® rc ' ss gives that upd 5t (e,„) < T e r . The maximality of e w 
gives that updid(e r ) < T upd st (e 1 „), and that there is no update event e = (p" , u(x), j") 
such that upd st (e lu ) < T e < T e r . Then we have upd st (e l „)—>-^ rc ' ct e r by the defini¬ 
tion of -4 rc ' ct . By ^ rc - ct = ^= r , c " ct we also have upd st (e 1 „)^= r ( c ' ct e r . By the def¬ 
inition of —>-^. r , c " ct we now have that e w is the greatest (w.r.t. < T >) store event with 
upd st (e UJ ) < T i e r . We also have that upd|d(e r ) < T / upd st (e,„). Since there can be no 
event e = (_, st(x), _) such that e— e r and upd|d(e r ) < T ' upd st (e), we have that e w 
is the maximal event e = (_, st(x), ) with respect to —7^ such that either upd st (e) < T > 
e r or e— e r . Hence e w —> s *?~ ss e r . This concludes the proof for —>-^ rc " ss = —>^ r , c " ss . 

Since —>-^. f ' ss (respectively —>^; ss ) is entirely determined by —>-^. rc ' ss and —(re¬ 
spectively —^ r , c ~ ss and —and we know that —>-® rc ' ss = —^ r , c ~ ss and —we 
immediately get that —= —>^‘ ss . This concludes the proof. 

Proof of Theorem 1 The theorem follows directly from Lemmas 1 and 2. 

B DPOR for TSO 

In this appendix, we establish the correctness of Theorem 2, which states that the DPOR 
algorithms Source-DPOR and Optimal-DPOR of [1], when based on the happens-before 
relation induced by chronological traces, explore at least one execution per equivalence 
class induced by Shasha-Snir traces. Theorem 2 also states that Optimal-DPOR ex¬ 
plores exactly one execution per equivalence class. We also provide more detail on how 
a DPOR algorithm, such as Source-DPOR of [1], can be used for SMC on programs 
running under TSO by computing chronological traces on the fly. 

Correctness of Source-DPOR also implies that several other DPOR algorithms, e.g., 
in [13,27] would be correct if based on chronological traces. This is because these 
algorithms are subsumed by Source-DPOR in the sense that the set of executions that are 
explored by these algorithms in some particular analysis includes the set of executions 
that could be explored by Source-DPOR in some analysis. 

Theorem 2. (Correctness of DPOR algorithms) The algorithms Source-DPOR and 
Optimal-DPOR of [1], based on the happens-before relation induced by chronological 


traces, explore at least one execution per equivalence class induced by Shasha-Snir 
traces. Moreover, Optimal-DPOR explores exactly one execution per equivalence class. 

Proof The proof of Theorem 2 mainly uses the correctness of Source-DPOR, which is 
proven in [1]. More precisely, in [1] it is proven that Source-DPOR is correct whenever 
it is based on an assignment of happens-before relations to executions, which is valid. 
An assignment of happens-before relations ^ T to executions r is valid if it satisfies the 
following natural properties (from [1]). 

1. —> r is a partial order on the events in r, which is included in < T , 

2. the events of each thread are totally ordered by —» T , 

3. if t' is a prefix of r, then —> T and —> T ' are the same on r'. 

4. the assignment of happens-before relations to executions partitions the set of exe¬ 
cutions into equivalence classes; i.e., if t' is a linearization of the happens-before 
relation on r, then t' is assigned the same happens-before relation as r; we use ~ 
to denote the corresponding equivalence relation, 

5. whenever r and t' are equivalent then they end up in the same global program state, 

6. for any sequences r, t' and t" , such that r • r" is an execution, we have r ~ r' if 
and only if r • t" ~ t' ■ r", and 

7. if r • (p, i, j) is an execution, whose last event is performed by thread p, and q, r are 

different threads, such that (p, i,j) would “happen before” a subsequent event by r 
but not a subsequent event by q, then (p, i, j) would also “happen before” (r, i",j") 
in the execution r • (p, i,j)- ( q, ■ (r, i", j”). 

A consequence of these definitions is that that if e and e! are two consecutive events in 
r with e -ft T e', then e and e' can be swapped without affecting the (global) state after 
the two events. 

The theorem can now be proven by establishing that the happens-before assign¬ 
ment induced by chronological traces is valid. Conditions 1, 2, 3, and 6 follow straight¬ 
forwardly from definitions Condition 4 follows by observing that changing the order 
between non-related events does not affect the definition of the chronological trace. 
Condition 5 follows by observing that the chronological trace captures all dependences 
that are needed for determining which values are read and written by loads and stores. 
Finally, Condition 7 follows by noting that an arrow between (p, i, j) and (r, i ", j") in 
a chronological trace cannot be removed by inserting an event that is independent with 
p. This concludes the proof of Theorem 2. □ 

We next provide more details on the computation of the happens-before relation 
induced by chronological traces. 

The happens-before relation —is computed using vector clocks, while taking the 
particular structure of chronological traces into account. The main difference from com¬ 
puting happens-before relations for sequentially consistent executions (see, e.g., [27]) 
is that load events which get their value by store forwarding are not immediately syn¬ 
chronized with the vector clock of the memory location. Instead the load is associated 
with the store buffer entry from which it got its value. The load is then synchronized 
with the memory location at the time when the store buffer entry is updated to memory. 

Formally, we introduce auxiliary configurations, and define the semantics of in¬ 
structions over them. When exploring an execution, all instructions will be applied 


simultaneously to the TSO system configuration (as described in Section A.2) and 
the auxiliary configuration. Below we need vector clocks. A vector clock is a func¬ 
tion C : (TID U AuxTID) i—^ N. The intuition is that C captures a set of observed 
events. For every thread p, the first C(p) events by p have been observed. We let 
VecClocks = ((TID U AuxTID) i—^ N) denote the set of vector clocks. 

An auxiliary configuration is a triple (C,B,f4), where 

C : (TID U AuxTID U Event U {-L}) i—> VecClocks 

maps each (real or auxiliary) thread identifier p to a vector clock representing which 
parts of the execution have been seen by p. Also, C maps each event e to the value 
of C(tid(e)) at the time immediately after executing e. We fix that C(_L) = (AaxO) 
is a zeroed clock. 

B : TID i—^ (MemLoc x Event x (Event U {_!_}))* 

maps each real (not auxiliary) thread ID p to a word of letters (x, e s , e/), each of 
which keeps auxiliary state for the corresponding letter in the store buffer B(p). 
Here x is the accessed memory location, e s is the store event that produced that 
letter, and ei is the latest buffer forwarded load event for which the letter has been 
the source (if there is no such event then cp =_L). 

M : MemLoc >->■ ((Event U {J-}) x 2 Event ) 

maps each memory location x to a pair ( e u , £)), where e u is the latest update event 
that accessed x (or _L if x has never been updated), and where If is a set which for 
each thread p that has read x contains the latest event of p that read the value of x. 

Initially all clocks in C are zeroed, all buffers in B are empty, and for all memory 
locations x we have Ad(x) = (_L, 0). 

The idea here is that as we execute memory accesses, we update the vector clock of 
the executing thread to reflect which new events have been observed. 

For example, when we execute an update e x which corresponds to a buffer entry 
(x, e s , ei), we look to the memory A'l(x) = ( e u , Ei). We know that the update event is 
ordered after the previous update e u , as well as the previous loads in Ei and the store 
event e s which enabled the update e x . We update the vector clock C(tid(e x )) of the 
auxiliary thread to include all these newly observed events. 

The procedure for a load from memory is similar, except that we do not observe 
previous loads. More interesting are loads that are satisfied by buffer forwarding. When 
we execute a buffer forwarded load e/ to x, we do not observe any new event, since the 
load was not able to reach and synchronize with the memory. Instead we save the load 
event with the buffer entry from which it read its value. When that entry is updated to 
memory, by the update event upd^e;), we move e; to the set of loads that have been 
observed by A4 (x). By this scheme the load event ei becomes available for observation 
by precisely the update events which succeed updid(e;). In the remainder of this section 
we will make this intuition formal. 

We will also need some notation for dealing with vector clocks. For two vector 
clocks v, v' we use the notation v + v' to denote the vector clock v" such that v"(p) = 
max(v(p),v'(p)) for all p. For two vector clocks v, v' we say that v < v' when v(p) < 
v'(p) for all p. We say that v < v' if at least one of the inequalities is strict. For an 
event e and a set E of events we define E © e = {e' € .E|tid(e') ^ tid(e)} U {e}, i.e. 
E © e is E where e replaces the previous event e! £ E s.t. tid(e') = tid(e). We use the 


shorthand f[x o, x\, ■ ■ ■ , x n v] to denote f[x o u][a;i u] • • • [x n v], i.e., an 
assignment of the same value to multiple function arguments. 

For two arbitrary auxiliary configurations c = (C,B,fA) and d = (C', B', Ad') we 
now define the transition relation c —> d depending on the next instruction op of p. We 
let j = C(p)(j>) + 1 be the index of the next event forp and C p = C(p)[p t—' j) be the 
corresponding clock and e = (p. op, j) be the event itself. 

At the same time as we compute the next auxiliary configuration, we also compute 
the set R(e) of races (e r , e' r ) such that e' r = e and e r is some earlier event. Recall that 
the races are the pairs of events from different threads related by -S^ u , -4 rc ~ ct > or -S-f ct . 


Store: If op = st(x), then c A d iff C = C\p,e C p \, and Ad' = Ad, and 
B' = B[p B(p) • (x, e, _L)]. There are no races: R(e ) = 0. 


Load from memory: If op = ld(x) and there is no letter on the form (x, _) in B(p), 

then c A d iff C = C[p, e ^ 6"], and B' = B and M.' = Ad[x (e u ,E®e)\, where 
M(x) = (e u , E). Here C p = C p +C(e u ) if e u and tid(e„) ^ upd(p), and C p = C p 
otherwise. If C(e u ) C p and tid(e u ) upd(p), then we have R(e) = {(e u ,e)}. 
Otherwise R(e) = 0. 

Intuitively, e is ordered after the last update e u to x, provided that e u originated in a 
different thread. 


Load from buffer: If op = ld(x) and B(p) = Bq ■ (x, e u , e{) ■ B\ for some B 0 , Bi,e u , e; 
with no letters on the form (x, _) in B\, then c A d iff C = C[p 1 e C p \ and 

B' = Bq ■ (x, e u , e) ■ B\ and M' = M. 

Notice that e replaces e/ in the store buffer entry. There are no races: R(e) = 0. 


Fence: If op = fence then c A d iff B' = B and A4 1 = A4 and C = C\p, e ^ 
C p + C(upd(p))]. There are no races: R(e) = 0. 

Notice that the semantics of the fence, as defined in Section A.2, guarantee that 
B(p) = e. Hence the vector clock of the auxiliary thread upd(p) includes the clocks 
of all updates of upd(p) corresponding to earlier stores. So e will be ordered after all 
updates of upd(p), as prescribed by —for chronological traces. 


Update: If op = u(x) then c A d iff B = (x,e s ,e r ) ■ B' and C = C\p,e 

C p + C{e s )+C(e u ) + i2 ei& Es.t . up d (tid(ei)) ^p C ( e ')] where A4 (x) = (e u ,E) and Ad' = 

Ad[x (e, E')\. Here E' = E if e r =_L, and E' = E © e r otherwise. There is a race 
with every previous access to x from a different thread: 


f?(e) 



e' € E U {e u } A d ^_L A 

tid(e') p A upd(tid(e')) p/\ 

C{d) f: C p + C(e s ) 



C Adaptation for PSO 


In this appendix, we show how our techniques can be adapted to the PSO memory 
model with minor changes. Before we see how to apply our methods to it, we give an 
informal description of the PSO memory model. 

C.l PSO Semantics 

PSO is a strictly more relaxed model than TSO. As described previously, TSO allows 
reordering of stores with subsequent loads. PSO allows the same reordering, but also 
allows the reordering of stores with subsequent stores to different memory locations. 

This behavior can be explained by an operational semantics similar to the one de¬ 
scribed in Section A.2 for TSO, but where each thread has a separate store buffer for 
each memory location. Each store buffer is FIFO-ordered, so stores to the same mem¬ 
ory location by the same thread cannot be reordered. But there is no order maintained 
between stores in different buffers, so stores by the same thread to different locations 
may update in reversed order. 

In Figure 11(a) we give an example of a program where PSO allows more behaviors 
than TSO. The execution in Figure 11(b) shows how the stores by p to x and y update 
to memory in reversed order. This allows the thread q to read first y = 1 then x = 0, 
which would be impossible both under SC and TSO. 

In the operational semantics for PSO we introduce one auxiliary thread updi p. x) 
for each pair of a thread p and a memory location x. Each such auxiliary thread is 
responsible for the updates to x by p, similarly to how upd(p) under TSO is responsible 
for all updates of p. 

C.2 Chronological Traces for PSO 

The adaptation of chronological traces to PSO is straightforward. The following simple 
adjustment suffices: Since stores from the same thread p to different memory locations 
x and y are updated by different auxiliary threads upd(p, x) and upd(p,y), there is no 
program order edge between the update events for different memory locations under 
PSO. 

A chronological trace for PSO is illustrated in Figure 11(c). Notice that there is no 
program order edge from (upd(p, x), u(x), 1) to (upd(p, y), u (y) , 1). Had there been 
one, the trace would be cyclic. 

D Chronological Traces Versus Xtop Objects 

In this appendix, we compare chronological traces and Xtop objects [4], 

Our goal was to design a mechanism for representing executions under relaxed 
memory models that would allow a stateless model checking technique to not explore 
more that one execution per Shasha-Snir trace. In fact, chronological traces allowed us 
to achieve this goal, since they are acyclic (in contrast to vanilla Shasha-Snir) but still 
correspond one-to-one to Shasha-Snir traces. This is not the case for Xtop objects [4], 


since two Xtop objects may map to the same Shasha-Snir trace. In fact, chronological 
traces and Xtop objects are different. This is because an Xtop object includes informa¬ 
tion about which event pairs are reordered (delayed), while a chronological trace does 
not. As an illustrative example, consider the SB idiom. A chronological trace is given in 
Fig. 4, and a corresponding Xtop object is given in [4, Fig. 4(b)]. Notice that the edges 
f(b) — de— > /(a) and /(c) — se— > f{d) in the Xtop object specify that a and b are 
reordered, while c and d execute in program order. This choice is not imposed by the 
chronological trace. Indeed, there is a different Xtop object where c and d are reordered 
instead of a and b. Both of these Xtop objects correspond to the same chronological 
trace and to the same Shasha-Snir trace (Fig. 3). Hence a POR technique that explores 
precisely all Xtop objects would unnecessarily explore more objects than our technique. 


V 

store: x:=l 
store: z: = l 
load: $r:= z 
load: $s:=y 


q 

store: y: = l 
store: z:=0 
load: $?':=z 
load: $s:=x 


(a) A small program illustrating 
the idiom of Peterson’s mutual ex¬ 
clusion algorithm. 



(b) The chronological trace Tc{t) corresponding to the 
execution in Figure 10(c). Notice that there is no edge 
between ( q , Id (z), 3) and either of the updates to z. 


(p, st(x),l) 

(<h st(y), l ) 

(P> st(z),2) 

(P> Id (z),3) 

(P. Id(y),4) 

(upd(g), u(y),l) 

(q, st(z),2) 

(q. Id(z),3) 

(upd(p). u(x), 1) 

(q, ld(x),4) 

(upd(p), u(z),2) 

(upd(g), u(z),2) 

(c) An execution r. 

(d) The Shasha-Snir trace T(r) corresponding to the 
execution in Figure 10(c). 


P q 

st(x) st(y) 



Fig. 10: Traces illustrated by the idiom of Peterson’s mutual exclusion algorithm. 




















p q 

store: x:=l load: $r:= y 
store: y:=l load: $s:=x 


(p,st(x), 1) 
(p,st(y),2) 

(upd(p, y), u(y), 1) 
(<?, ld(y),l) 

(<?, Id(x),2) 
(upd(p, x),u(x),l) 


(a) The mp idiom. Possible under PSO: (b) An execution r where finally $r 

Sr = 1, $s = 0. $s = 0. 


V upd(p,x) upd(p,y) q 



cf 


(c) The chronological trace of r under PSO. 

Fig. 11: A behavior allowed under PSO but not under TSO. 




